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Most results in nonparametric regression theory are developed 
only for the case of additive noise. In such a setting many smoothing 
techniques including wavelet thresholding methods have been devel- 
oped and shown to be highly adaptive. In this paper we consider 
nonparametric regression in exponential families with the main fo- 
cus on the natural exponential families with a quadratic variance 
function, which include, for example, Poisson regression, binomial 
regression and gamma regression. We propose a unified approach of 
using a mean-matching variance stabilizing transformation to turn 
the relatively complicated problem of nonparametric regression in ex- 
ponential families into a standard homoscedastic Gaussian regression 
problem. Then in principle any good nonparametric Gaussian regres- 
sion procedure can be applied to the transformed data. To illustrate 
our general methodology, in this paper we use wavelet block thresh- 
olding to construct the final estimators of the regression function. 
The procedures are easily implementable. Both theoretical and nu- 
merical properties of the estimators are investigated. The estimators 
are shown to enjoy a high degree of adaptivity and spatial adaptiv- 
ity with near-optimal asymptotic performance over a wide range of 
Besov spaces. The estimators also perform well numerically. 

1. Introduction. Theory and methodology for nonparametric regression 
is now well developed for the case of additive noise particularly additive 
homoscedastic Gaussian noise. In such a setting many smoothing techniques 
including wavelet thresholding methods have been developed and shown 
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to be adaptive and enjoy other desirable properties over a wide range of 
function spaces. However, in many applications the noise is not additive and 
the conventional methods are not readily applicable. For example, such is 
the case when the data are counts or proportions. 

In this paper we consider nonparametric regression in exponential fami- 
lies with the main focus on the natural exponential families with a quadratic 
variance function (NEF-QVF). These include, for example, Poisson regres- 
sion, binomial regression and gamma regression. We present a unified treat- 
ment of these regression problems by using a mean-matching variance stabi- 
lizing transformation (VST) approach. The mean-matching VST turns the 
relatively complicated problem of regression in exponential families into a 
standard homoscedastic Gaussian regression problem and then any good 
nonparametric Gaussian regression procedure can be applied. 

Variance stabilizing transformations and closely related normalizing trans- 
formations have been widely used in many parametric statistical inference 
problems. See Hoyle (1973), Efron (1982) and Bar-Lev and Enis (1990). 
In the more standard parametric problems, the goal of VST is often to 
optimally stabilize the variance. That is, one desires the variance of the 
transformed variable to be as close to a constant as possible. For example, 
Anscombe (1948) introduced VSTs for binomial, Poisson and negative bi- 
nomial distributions that provide the greatest asymptotic control over the 
variance of the resulting transformed variables. In the context of nonpara- 
metric function estimation, Anscombe's variance stabilizing transformation 
has also been briefly discussed in Donoho (1993) for density estimation. 
However, for our purposes it is much more essential to have optimal asymp- 
totic control over the bias of the transformed variables. A mean-matching 
VST minimizes the bias of the transformed data while also stabilizing the 
variance. 

Our procedure begins by grouping the data into many small size bins, 
and by then applying the mean-matching VST to the binned data. In prin- 
ciple any good Gaussian regression procedure could be applied to the trans- 
formed data to construct the final estimator of the regression function. To 
illustrate our general methodology, in this paper we employ two wavelet 
block thresholding procedures. Wavelet thresholding methods have achieved 
considerable success in nonparametric regression in terms of spatial adaptiv- 
ity and asymptotic optimality. In particular, block thresholding rules have 
been shown to possess impressive properties. In the context of nonparamet- 
ric regression, local block thresholding has been studied, for example, in 
Hall, Kerkyacharian and Picard (1998), Cai (1999, 2002) and Cai and Sil- 
verman (2001). In this paper we shall use the BlockJS procedure proposed 
in Cai (1999) and the NeighCoeff procedure introduced in Cai and Silver- 
man (2001). Both estimators were originally developed for nonparametric 
Gaussian regression. BlockJS first divides the empirical coefficients at each 
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resolution level into nonoverlapping blocks and then simultaneously esti- 
mates all the coefficients within a block by a James-Stein rule. NeighCoeff 
also thresholds the empirical coefficients in blocks, but estimates wavelet 
coefficients individually. It chooses a threshold for each coefficient by refer- 
encing not only to that coefficient but also to its neighbors. Both estimators 
increase estimation accuracy over term-by-term thresholding by utilizing 
information about neighboring coefficients. 

Both theoretical and numerical properties of our estimators are investi- 
gated. It is shown that the estimators enjoy excellent asymptotic adaptivity 
and spatial adaptivity. The procedure using BlockJS simultaneously attains 
the optimal rate of convergence under the integrated squared error over a 
wide range of the Besov classes. The estimators also automatically adapt 
to the local smoothness of the underlying function; they attain the local 
adaptive minimax rate for estimating functions at a point. A key step in the 
technical argument is the use of the quantile coupling inequality of Komlos, 
Major and Tusnady (1975) to approximate the binned and transformed data 
by independent normal variables. The procedures are easy to implement, at 
the computational cost of 0(n). In addition to enjoying the desirable theo- 
retical properties, the procedures also perform well numerically. 

Our method is applicable in more general settings. It can be extended to 
treat nonparametric regression in general one-parameter natural exponential 
families. The mean-matching VST only exists in NEF-QVF (see Section 2). 
In the general case when the variance is not a quadratic function of the 
mean, we apply the same procedure with the standard VST in place of the 
mean-matching VST. It is shown that, under slightly stronger conditions, the 
same optimality results hold in general. We also note that mean-matching 
VST transformations exist for some useful nonexponential families, including 
some commonly used for modeling "over-dispersed" data. Though we do not 
pursue the details in the present paper, it appears that because of this our 
methods can also be effectively used for nonparametric regressions involving 
such error distributions. 

We should note that nonparametric regression in exponential families has 
been considered in the literature. Among individual exponential families, 
the Poisson case is perhaps the most studied. Besbeas, De Feis and Sapati- 
nas (2004) provided a review of the literature on the nonparametric Pois- 
son regression and carried out an extensive numerical comparison of several 
estimation procedures including Donoho (1993), Kolaczyk (1999a, 1999b) 
and Fryzlewicz and Nason (2001). In the case of Bernoulli regression, Anto- 
niadis and Leblanc (2000) introduced a wavelet procedure based on diagonal 
linear shrinkers. Unified treatments for nonparametric regression in expo- 
nential families have also been proposed. Antoniadis and Sapatinas (2001) 
introduced a wavelet shrinkage and modulation method for regression in 
NEF-QVF and showed that the estimator attains the optimal rate over the 
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classical Sobolev spaces. Kolaczyk and Nowak (2005) proposed a recursive 
partition and complexity-penalized likelihood method. The estimator was 
shown to be within a logarithmic factor of the minimax rate under squared 
Hellinger loss over Besov spaces. 

The paper is organized as follows. Section 2 discusses the mean- matching 
variance stabilizing transformation for natural exponential families. In Sec- 
tion 3, we first introduce the general approach of using the mean-matching 
VST to convert nonparametric regression in exponential families into a non- 
parametric Gaussian regression problem, and then present in detail specific 
estimation procedures based on the mean-matching VST and wavelet block 
thresholding. Theoretical properties of the procedures are treated in Section 
4. Section 5 investigates the numerical performance of the estimators. We 
also illustrate our estimation procedures in the analysis of two real data sets: 
a gamma-ray burst data set and a packet loss data set. Technical proofs are 
given in Section 6. 

2. Mean-matching variance stabilizing transformation. We begin by con- 
sidering variance stabilizing transformations (VST) for natural exponential 
families. As mentioned in the Introduction, VST has been widely used in 
many contexts and the conventional goal of VST is to optimally stabilize 
the variance. See, for example, Anscombe (1948) and Hoyle (1973). For our 
purpose of nonparametric regression in exponential families, we shall first 
develop a new class of VSTs, called mean-matching VSTs, which asymptot- 
ically minimize the bias of the transformed variables while at the same time 
stabilizing the variance. 

Let X\ , X2 , . . . , X m be a random sample from a distribution in a nat- 
ural one-parameter exponential families with the probability density/mass 
function 

q(x\r ] )=e r > x -^ r i ) h(x). 

Here rj is called the natural parameter. The mean and variance are, respec- 
tively, 

H(rj) = ip'(rj) and cr 2 (i]) = ip"(rj). 

We shall denote the distribution by NEF(//). A special subclass of interest 
is the one with a quadratic variance function (QVF), 

(1) a 2 = V(fj.) = a + aifi + a 2 /" 2 - 

In this case we shall write X{ ~ NQ(^). The NEF-QVF families consist of 
six distributions, three continuous: normal, gamma and NEF-GHS distribu- 
tions and three discrete: binomial, negative binomial and Poisson. See, for 
example, Morris (1982) and Brown (1986). 
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Set X = Xi. According to the central limit theorem, 

y/m(X/m — fi(r))) — > N(0, V(n(rj))) as m — > oo. 

A variance stabilizing transformation (VST) is a function G : R — )■ R such 
that 

(2) G>(n) = V- l l 2 (ri. 
The standard delta method then yields 

V^{G(X/m) - G(n(j}))} A N(0, 1). 

It is known that the variance stabilizing properties can often be further 
improved by using a transformation of the form 

(3) H m (X ) = G(§±£) 

with suitable choice of constants a and b. See, for example, Anscombe (1948). 
In this paper we shall use the VST as a tool for nonparametric regression 
in exponential families. For this purpose, it is more important to optimally 
match the means than to optimally stabilize the variance. That is, we wish 
to choose the constants a and b such that K{H m (X)} optimally matches 

To derive the optimal choice of a and b, we need the following expansions 
for the mean and variance of the transformed variable H m (X). 



Lemma 1. Let be a compact set in the interior of the natural param- 
eter space. Then for n G and for constants a and b, 

(4) E{H m (X)} - = ( a " bKv) ~ " m' 1 + O(m^) 
and 

(5) Var{tf m (A)} = - + 0(m~ 2 ). 

m 

Moreover, there exist constants a and b such that 

(6) E j G (^_±^ | _ G ( M(r?) ) = (m^) 

for all rj € with a positive Lebesgue measure if and only if the exponential 
family has a quadratic variance function. 
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The proof of Lemma 1 is given in Section 6. The last part of Lemma 1 
can be easily explained as follows. Equation (4) implies that (6) holds if and 
only if 

that is, n"(r]) = Aa^'{rj) — 4bfj,(rj)fjf(rj). Solving this differential equation 
yields 

(7) a 2 {r,) = M '(r/) = a + 4a / u(r ? ) - 2b fJ , 2 (r ) ) 

for some constant ao- Hence the solution of the differential equation is exactly 
the subclass of natural exponential family with a quadratic variance function 
(QVF). 

It follows from (7) that among the VSTs of the form (3) for the exponential 
family with a quadratic variance function 

a 2 = a + a\fi + a 2 /x 2 , 

the best constants a and b for mean-matching are 

(8) a = \a\ and b = —\ol2- 

We shall call the VST (3) with the constants a and b given in (8) the 
mean-matching VST. Lemma 1 shows that the mean-matching VST only 
exists in the NEF-QVF families and with the mean-matching VST the bias 
E{G(^±f )}- G(n(r})) is of the order (to" 2 ). In contrast, for an NEF without 

a quadratic variance function, the term a — ^(77)6 — ^r^y does not vanish 
for all 77 with any choice of a and b. And in this case the bias 

E { G (§Ti)}- G ^)) = °( m " 1 ) 

instead of 0(m~ 2 ) in (6). We shall see in Section 4 that this difference has 
important implications for nonparametric regression in NEF. 

The following are the specific expressions of the mean-matching VST H m 
for the five distributions (other than normal) in the NEF-QVF families: 

• Poisson: a = 1/4, b = and H m (X) = 2*J(X + \)/m. 

• Binomial(r,p): a= 1/4, b= ^ and H m (X) = arcsin ( J f^[^ 2 ) ■ 

• Negative Binomial(r,p): a = l/4, b = —-^p and 
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Gamma(r, A) (with r known): a = 0, b = — ^ and H m (X) = \fr ^( ^^ l' 
NEF-GHS(r, A) (with r known): a = 0, b = -± and 



H m (X) = ^\J ^— + Jl + 



rm— 1/2 y {mr — 1/2) 2 / 

Note that the mean-matching VST is different from the more conventional 
VST that optimally stabilizes the variance. Take the binomial distribution 
with r = 1 as an example. In this case the VST is an arcsine transformation. 

Let Xi, . . . ,X m Bernoulli(p) and then X = J^^Li ~ Binomial(m,p). 
Figure 1 compares the mean and variance of three arcsine transformations 
of the form 




for the binomial variable X with m = 30. The choice of c = gives the usual 
arcsine transformation, c = 3/8 optimally stabilizes the variance asymptoti- 
cally, and c = 1/4 yields the mean-matching arcsine transformation. The left 
panel of Figure 1 plots the bias 



?7i(E p arcsin(-\/ {X + c)/(m + 2c)) — arcsin(- v /p)) 

as a function of p for c = 0, c = \ and c = |. It is clear from the plot that 
c = \ is the best choice among the three for matching the mean. On the 
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other hand, the arcsine transformation with c = yields significant bias and 
the transformation with c = | also produces noticeably larger bias. The right 
panel plots the variance of \pm arcsin( yJ{X + c) /(m + 2c) ) for c = 0, c = \ 
and c = | . Interestingly, over a wide range of values of p near the center the 
arcsine transformation with c = j is even slightly better than the case with 
c = | and clearly c = is the worst choice of the three. Figure 2 below shows 
similar behavior for the Poisson case. 

Let us now consider the Gamma distribution with r = 1 as an example 
for the continuous case. The VST in this case is a log transformation. Let 

X\, . . . , X m Exponential(A). Then X = Y^dLi Xi ~ Gamma(m, A). Figure 
3 compares the mean and variance of two log transformations of the form 

(9) 

\m — c ) 

for the Gamma variable X with A = 1 and m ranging from 3 to 40. The 
choice of c = gives the usual log transformation, and c = 1/2 yields the 
mean-matching log transformation. The left panel of Figure 3 plots the bias 
as a function of m for c = and c=\. It is clear from the plot that c = \ 
is a much better choice than c = for matching the mean. It is interesting 
to note that in this case there do not exist constants a and b that optimally 
stabilize the variance. The right panel plots the variance of y / mln(X), that 
is, c = 0, as a function of m. In this case, it is obvious that the variances are 
the same with c = and c = 1/2 for the variable in (9). 




Fig. 2. Comparison of the mean (left panel) and variance (right panel) of the root trans- 
formations for Poisson(A) with c = (solid line), c= \ (+ line) and c=| (dashed line). 
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Fig. 3. Comparison of the mean (left panel) and variance (right panel) of the log trans- 
formations for Gamma(m,A) with c = (solid line) and c=\ (+ line). 



Remark 1. Mean- matching variance stabilizing transformations exist 
for some other important families of distributions. We mention two that 
are commonly used to model "over-dispersed" data. The first family is of- 
ten referred to as the gamma-Poisson family. See, for example, Johnson, 
Kemp and Kotz (2005), Berk and MacDonald (2008) and Hilbe (2007). Let 

Xi\Zi ~ Poisson(Zj) with Zi ^ Gamma(a,o"), i = 1, . . . ,m. The Zi are la- 
tent variables; only the JQ are observed. The scale parameter, a, is assumed 
known, and the mean /x = aa is the unknown parameter, < fi < oo. The 
resulting family of distributions of each X{ is a subfamily of the negative 
Binomial (r,p) with p = (1 + o") , a fixed constant, and r = fi/cr. [Here 
this negative Binomial family is defined for all r > as having probability 
function, P(k) = T(k + r)p r (l - p) k /T(k + l)r(r), k = 0, 1, . .. .] This is a 
one-parameter family, but it is not an exponential family. It can be verified 
that a mean-matching variance stabilizing transformation for this family is 
given by 



V m 4m 

This transformation has the desired properties (5) and (6) with G(pt) = 
2y / 7i. For the second family, consider the beta-binomial family. See Johnson, 

Kemp and Kotz (2005). Here, Xi\Zi '~ Binomial(r, Zi) and ~* Beta(a,6), 
i = 1, . . . ,m. Again, the Zi are latent variables; only the Xj are observed. 
For the family of interest here, we assume a, b are allowed to vary so that 
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a + b = k, a known constant, and < fi = a/ {a + b) < 1. This family can al- 
ternatively be parameterized via fx, a = — fi)/(k + l). The resulting one- 
parameter family of distributions of each Xi is again not a one-parameter 
exponential family. It can be verified that a mean-matching variance stabi- 
lizing transformation for this family is given by 

Y = H m (X) = 2^arcsin J X + (<T + 1)/4 

V rm + (cr + lj/2 

This transformation has the desired properties (5) and (6) with G{(i) = 2 x 
arcsin yfji. 

3. Nonparametric regression in exponential families. We now turn to 
nonparametric regression in exponential families. We begin with the NEF- 
QVF. Suppose we observe 

(10) Yi ~ d NQ(/(ii)), i = l,...,n,ti = -, 

n 

and wish to estimate the mean function f(t). In this setting, for the five 
NEF-QVF families discussed in the last section the noise is not additive 
and non-Gaussian. Applying standard nonparametric regression methods 
directly to the data {Yi} in general do not yield desirable results. Our strat- 
egy is to use the mean-matching VST to reduce this problem to a standard 
Gaussian regression problem based on a sample {Yj : j = 1, . . . , T} where 

% ~ NiGtffa)),™- 1 ), t j= j/T,j = 1,2,..., T. 

Here G is the VST defined in (2), T is the number of bins, and m is the 
number of observations in each bin. The values of T and m will be specified 
later. 

We begin by dividing the interval into T equi- length subintervals with m = 
n/T observations in each subintervals. Let Qj be the sum of observations 
on the jth subinterval Ij = [^-, ^), j = 1, 2, . . . ,T, 

jm 

(11) Q j= YI Y *- 

i=(J— l)m+l 

The sums {Qj} can be treated as observations for a Gaussian regression di- 
rectly, but this in general leads to a heteroscedastic problem. Instead, we ap- 
ply the mean-matching VST discussed in Section 2, and then treat H m (Qj) 
as new observations in a homoscedastic Gaussian regression problem. To be 
more specific, let 

(12) Y*=H m (Q j ) = G^±^j ) j = l,...,T, 
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where the constants a and b are chosen as in (8) to match the means. The 
transformed data Y* = (Y* , . . . , Y£ ) is then treated as the new equi-spaced 
sample for a nonparametric Gaussian regression problem. 

We will first estimate G(f(ti)), then take a transformation of the estima- 
tor to estimate the mean function /. After the original regression problem is 
turned into a Gaussian regression problem through binning and the mean- 
matching VST, in principle any good nonparametric Gaussian regression 
method can be applied to the transformed data {Y?} to construct an esti- 
mate of G(/(-)). The general ideas for our approach can be summarized as 
follows. 

1. Binning: divide {1^} into T equal length intervals between and 1. Let 
Q11Q21 ■ ■ ■ ,Qt be the sum of the observations in each of the intervals. 
Later results suggest a choice of T satisfying T x n 3 / 4 for the NEF-QVF 
case and T x n 1 / 2 for the non-QVF case. See Section 4 for details. 

2. VST: let Y* = H m (Qj), j = 1,...,T, and treat Y* = (Y* ,Y* , . . . ,Y*) 
as the new equi-spaced sample for a nonparametric Gaussian regression 
problem. 

3. Gaussian regression: apply your favorite nonparametric regression proce- 
dure to the binned and transformed data Y* to obtain an estimate G(/) 
ofG(/). ^ 

4. Inverse VST: estimate the mean function / by / = G~ 1 (G(/)). If G(/) 
is not in the domain of G -1 which is an interval between a and b (a and b 

can be 00), we set G^GC/)) = G~V) if G(f) < a and set G'^Gif)) = 
G~ l (b) if G(/) > b. For example, G _1 (a) = when a < in the case of 
negative Binomial and NEF-GHS distributions. 

3.1. Effects of binning and VST. As mentioned earlier, after binning and 
the mean-matching VST, one can treat the transformed data {Y?} as if they 
were data from a homoscedastic Gaussian nonparametric regression prob- 
lem. A key step in understanding why this procedure works is to understand 
the effects of binning and the VST. Quantile coupling provides an important 
technical tool to shed insights on the procedure. 

The following result, which is a direct consequence of the quantile coupling 
inequality of Komlos, Major and Tusnady (1975), shows that the binned 
and transformed data can be well approximated by independent normal 
variables. 

Lemma 2. Let Xi NQ(/i) with variance V for i = 1, .. . ,m and let 
X = Y27Li-^i- Under the assumptions of Lemma 1, there exists a standard 
normal random variable Z~ A^(0, 1) and constants ci, 02,03 > not depend- 
ing on m such that whenever the event A = {\X — m/x| < c\m} occurs, 

(13) \X - miM - y/mV Z\ <c 2 Z 2 + c 3 . 
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Hence, for large m, X can be treated as a normal random variable with 
mean m\x and variance mV. Let Y = H m (X) = G(Sf ), e = ~EY — G(fx) and 
Z be a standard normal variable satisfying (13). Then Y can be written as 

(14) Y = G{^)+e + rrT l l 2 Z + i, 
where 

(15) e = G(|±f)-G( M )-6-m- 1 /«Z. 

In the decomposition (14), e is the deterministic approximation error be- 
tween the mean of Y and its target value G(fx) and £ is the stochastic error 
measuring the difference of Y and its normal approximation. It follows from 
Lemma 1 that when m is large, e is "small," |e| < cm~ 2 for some constant 
c > 0. The following result, which is proved in Section 6.1, shows that the 
random variable £ is "stochastically small." 

Lemma 3. Let NQ(/i) variance V for i = 1, . . . , ra, and X = 

-Xj • £ef Z be the standard normal variable given as in Lemma 2 and 
let £ be given as in (15). Then for any integer k>l there exists a constant 
Ck > such that for all A > 1 and all a > 0, 

(16) E\£\ k <C k m~ k and P(|£| > a) < C k (a m y k . 

The discussion so far has focused on the effects of the VST for i.i.d. obser- 
vations. In the nonpar ametric function estimation problem mentioned ear- 
lier, observations in each bin are independent but not identically distributed 
since the mean function / is not a constant in general. However, through cou- 
pling, observations in each bin can in fact be treated as if they were i.i.d. ran- 
dom variables when the function / is smooth. Let Xj ~ NQ(/ij), i = 1, . . . , m, 
be independent. Here the means \i{ are "close" but not equal. Let \i be a 
value close to the /V s - The analysis given in Section 6.1 shows that X; can 

in fact be coupled with i.i.d. random variables X jC where Xi C NQ(/i). 
See Lemma 4 in Section 6.1 for a precise statement. 

How well the transformed data {Y?} can be approximated by an ideal 
Gaussian regression model depends partly on the smoothness of the mean 
function /. For < d < 1, define the Lipschitz class A d (M) by 

A d (M) = {/ : \f(h) - f{t 2 )\ < M\h - t 2 \ d <h,t 2 <l} 

and 

F d (M, e,v) = {f:fe A d (M), f{t) E [e, v},for all x € [0, 1]}, 

where [e, v] with e < v is a compact set in the interior of the mean parameter 
space of the natural exponential family. Lemmas 1, 2, 3 and 4 together yield 
the following result which shows how far away are the transformed data 
{1^*} from the ideal Gaussian model. 
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Theorem 1. LetYf =G(=$£) be given as in (12) and let f G F d (M,e,v). 
Then Y? can be written as 

(17) Y? = G(f(±y\+ ej + m -V*Z j + Z j , j = l,2,...,T, 

where Zj *~ ' N(0, 1), €j are constants satisfying \ej\ < c(m~ 2 + T~ d ) and 
consequently for some constant C > 

(18) l£ e 2< C(m -4 + r -2 d) 

and £j are independent and "stochastically small" random variables satisfy- 
ing that for any integer k > and any constant a > 

m^\ k <C k \og 2k m-(m~ k + T~ dk ) and 

(19) 

F(\tj\>a)<C k \og 2k m-(m- k + T- dk )a- k , 
where > is a constant depending only on k,d and M . 

Theorem 1 provides explicit bounds for both the deterministic and stochas- 
tic errors. This is an important technical result which serves as a major tool 
for the proof of the main results given in Section 4. 

Remark 2. There is a tradeoff between the two terms in the bound (18) 
for the overall approximation error ^ Ylj=i e j- There are two sources to the 
approximation error: one is the variation of the functional values within a 
bin and one comes from the expansion of the mean of Y? (see Lemma 1). The 
former is related to the smoothness of the function / and is controlled by the 
T~ 2d term and the latter is bounded by the m -4 term. In addition, there 
is the discretization error between the sampled function {G(f(j/T)):j = 
1, ...,T} and the whole function G(f(t)), which is obviously a decreasing 
function of T. Furthermore, the choice of T also affects the stochastic error 
£. A good choice of T makes all three types of errors negligible relative to 
the minimax risk. See Section 4.2 for further discussions. 

Remark 3. In Section 4 we introduce Besov balls Bp q (M) for the anal- 
ysis of wavelet regression methods. A Besov ball Bp q (M) can be embedded 
into a Lipschitz class A d (M') with d = min(a — 1/p, 1) and some M' > 0. 

Although the main focus of this paper is on the NEF-QEF, our method 
of binning and VST can be extended to the general one-parameter NEF. 
This extension is discussed in Section 4.1 where a version of Theorem 1 for 
the standard VST is developed in the general case. 
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3.2. Wavelet thresholding. One can apply any good nonparametric Gaus- 
sian regression procedure to the transformed data {Y?} to construct an esti- 
mator of the function /. To illustrate our general methodology, in the present 
paper we shall use wavelet block thresholding to construct the final estima- 
tors of the regression function. Before we can give a detailed description of 
our procedures, we need a brief review of basic notation and definitions. 

Let {</>, ijj} be a pair of father and mother wavelets. The functions <f> and 
ip are assumed to be compactly supported and J cj) = 1, and dilation and 
translation of 4> and ip generates an orthonormal wavelet basis. For simplicity 
in exposition, in the present paper we work with periodized wavelet bases 
on [0,1]. Let 

oo oo 
l=— oo l=— oo 

where <j>j,k(t) = 2 j / 2 (f)(2 j t - k) and ipj,k{t) = 2 j / 2 ip(2 j t - k). The collection 
{<$j fc, k = 1, . . . , 2 J0 ; i/ij k ,j > jo > 0, k = 1, . . . , 2 J } is then an orthonormal 
basis of L 2 [0,1], provided the primary resolution level jo is large enough 
to ensure that the support of the scaling functions and wavelets at level jo 
is not the whole of [0, 1] . The superscript "p" will be suppressed from the 
notation for convenience. An orthonormal wavelet basis has an associated 
orthogonal Discrete Wavelet Transform (DWT) which transforms sampled 
data into the wavelet coefficients. See Daubechies (1992) and Strang (1992) 
for further details about the wavelets and discrete wavelet transform. A 
square-integrable function / on [0, 1] can be expanded into a wavelet series: 

2 j o oo V 

(20) /(*) = £ 

k=l j=jo k=l 

where 9j t k = (f,(f>j,k),Qj,k = if-i^j.k) are the wavelet coefficients of /. 

3.3. Wavelet procedures for generalized regression. We now give a de- 
tailed description of the wavelet thresholding procedures BlockJS and Neigh- 
Coeff in this section and study the properties of the resulting estimators in 
Section 4. We shall show that our estimators enjoy a high degree of adap- 
tivity and spatial adaptivity and are easily implementable. 

Apply the discrete wavelet transform to the binned and transformed data 
Y* , and let U = T~ 1 / 2 WY* be the empirical wavelet coefficients, where W 
is the discrete wavelet transformation matrix. Write 

(21) U = (y jo ,i, y j0t y , y j0t i, y jo ^o Vj-i,i, i/j-i^-O'- 

Here yj Q) k are the gross structure terms at the lowest resolution level, 
and yj : k (j = jo, . . . , J — 1, k = 1, . . . , 2 J ) are empirical wavelet coefficients 
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at level j which represent fine structure at scale 2 J . The empirical wavelet 
coefficients can then be written as 

(22) y 3) k = Oj,k + £j,k + ~~/= z j,k + £j,ki 



n 

where 6, k are the true wavelet coefficients of G(f), e,- ^ are "small" determin- 
istic approximation errors, Zj t k are i.i.d. N{0, 1), and are some "small" 
stochastic errors. The theoretical calculations given in Section 6 will show 
that both €j t k and & are negligible. If these negligible errors are ignored 
then we have 

(23) yj t k ~ 9j,k H — 7= z j,k, 



which is the idealized Gaussian sequence model with noise level a = 1/ y/n. 
Both BlockJS [Cai (1999)] and NeighCoeff [Cai and Silverman (2001)] were 
originally developed for this ideal model. Here we shall apply these methods 
to the empirical coefficients yj^ as if they were observed as in (23). 

We first describe the BlockJS procedure. At each resolution level j, the 
empirical wavelet coefficients yj^ are grouped into nonoverlapping blocks of 
length L. As in the sequence estimation setting let B % - = {(j, k):{i — 1)L + 1 < 
k < iL} and let 5| j = fyeBi Uj fc- A modified James-Stein shrinkage rule 
is then applied to each block B 1 -, that is, 

(24) e j>k =(l-^-) Vj,k for (j,k) € Bj, 

V na j,i / + 

where A* = 4.50524 is the solution to the equation A* — log A* = 3 [see Cai 
(1999) for details], and - is approximately the variance of each y^k- F° r the 

gross structure terms at the lowest resolution level jo, we set 6 j 0i k = yj ,k- The 
estimate of G(f(-)) at the equally spaced sample points {y :i = 1, . . . ,T} is 
then obtained by applying the inverse discrete wavelet transform (IDWT) 
to the denoised wavelet coefficients. That is, {G(/(^)) :i = 1, . . . , T} is es- 
timated by &lj) = {G(f(f)):i = l,...,T} with G(f) = T l ' 2 W- 1 ■ 9. The 
estimate of the whole function G(f) is given by 

2^0 ^ J-l 2 j 

k=i j=j k=i 

The mean function / is estimated by 

(25) fBj S (t) = G- 1 (G(f(t)))- 

Figure 4 shows the steps of the procedure for an example in the case of 
nonparametric Gamma regression. 
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Fig. 4. An example of nonparametric Gamma regression using the mean-matching VST 
and wavelet block thresholding. 



We now turn to the NeighCoeff procedure. This procedure, introduced in 
Cai and Silverman (2001) for Gaussian regression, incorporates information 
about neighboring coefficients in a different way from the BlockJS procedure. 
NeighCoeff also thresholds the empirical coefficients in blocks, but estimates 
wavelet coefficients individually. It chooses a threshold for each coefficient 
by referencing not only to that coefficient but also to its neighbors. As shown 
in Cai and Silverman (2001), NeighCoeff outperforms BlockJS numerically, 
but with slightly inferior asymptotic properties. 

Let the empirical coefficients {yj,k} be given same as before. To estimate a 
coefficient 6j t k at resolution level j, we form a block of size 3 by including the 
coefficient yj\. together with its immediate neighbors Vj k—i an d Uj,k+i- (If 
periodic boundary conditions are not being used, then for the two coefficients 
at the boundary blocks, again of length 3, are formed by only extending in 
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one direction.) Estimate the coefficient 8j k by 

/ 21ogn\ 

(26) ^, fc =11- — Vj,k, 

V n,5 j,fc / + 

where S? k = Uj k _i + + vjk+v The g ross structure terms at the lowest 

resolution level are again estimated by 9j 0i k = Vjo,k- The rest of the steps are 
same as before. Namely, the inverse DWT is applied to obtain an estimate 

G(f) and the mean function / is then estimated by /nc(^) = G~ 1 (G(f(t))). 

We can envision a sliding window of size 3 which moves one position 
each time and only the middle coefficient in the center is estimated for a 
given window. Each individual coefficient is thus shrunk by an amount that 
depends on the coefficient and on its immediate neighbors. Note that Neigh- 
Coeff uses a lower threshold level than the universal thresholding procedure 
of Donoho and Johnstone (1994). In NeighCoeff, a coefficient is estimated 
by zero only when the sum of squares of the empirical coefficient and its 
immediate neighbors is less than 2<r 2 logn, or the average of the squares is 
less than | a 2 log n. 

4. Theoretical properties. In this section we investigate the asymptotic 
properties of the procedures proposed in Section 3. Numerical results will 
be given in Section 5. 

We study the theoretical properties of our procedures over the Besov 
spaces that are by now standard for the analysis of wavelet regression meth- 
ods. Besov spaces are a very rich class of function spaces and contain as spe- 
cial cases many traditional smoothness spaces such as Holder and Sobolev 
spaces. Roughly speaking, the Besov space contains functions having a 
bounded derivatives in LP norm, the third parameter q gives a finer grada- 
tion of smoothness. Full details of Besov spaces are given, for example, in 
Triebel (1992) and DeVore and Popov (1988). A wavelet ip is called r-regular 
if ip has r vanishing moments and r continuous derivatives. For a given r- 
regular mother wavelet tp with r > a and a fixed primary resolution level jo , 
the Besov sequence norm || • \\b^ q of the wavelet coefficients of a function / 
is then defined by 

/ oo \ 1/9 

(27) ii/ik,=nijip+ (E^iw) ■ 

where £ is the vector of the father wavelet coefficients at the primary 
resolution level jo , Oj is the vector of the wavelet coefficients at level j , and 
s = a + ^ — p>0. Note that the Besov function norm of index (a,p, q) of a 
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function / is equivalent to the sequence norm (27) of the wavelet coefficients 
of the function. See Meyer (1992). We define 

(28) B° q (M)={f;\\f\\ b?q <M} 
and 

(29) F« q (M,e,v) = {/:/€ B« q (M), f(t) G [e,v] for all t G [0, 1]}, 

where [e, v] with e < v is a compact set in the interior of the mean parameter 
space of the natural exponential family. 

The following theorem shows that our estimators achieve near optimal 
global adaptation under integrated squared error for a wide range of Besov 
balls. 



Theorem 2. Suppose the wavelet ip is r-regular. Let Xi ~ NQ(/(tj)),z = 
1, . . . , n,ti = j-. Let T = en 3 / 4 . Then the estimator /bjs defined in (25) sat- 
isfies 

Cn ~(2a)/(l+2a) 



sup E||/ B js-/|||< { 



,3 1\ 

p > 2, a < r ana -[a > 

2 V P) 

C n -(2a)/(l+2a) (l og n )(2-p)/(p(l+2a)) ^ 

1 



2a 



l + 2a' 



1 < p < 2, a <r and — I a J > 



P 



2a 



l + 2a 



and the estimator /nc satisfies 
sup E||/nc 

feF£ q (M,e,v) 



f\\l<C 



logn 



(2a)/(l+2a) 



p > 1) ol < r and - {a > 

P 



1 



2a 



l + 2a 



Remark 4. Note that when f(t) G [e, v], the condition / G Bp q (M) im- 
plies that there exists M' > such that G{f) G B° q (M') with 



M' = c + cM 



laj+l 

E Q^ _1 + c LQ j + i 



for some c> 0, 



where q = sup y6 [ £ ^ |G^(y)| with / = 0, . . . , [a] + 1, since it follows from 
Theorem 3 on page 344 and Remark 3 on page 345 of Runst (1986) that 



\\G(f)\\ Bs <\\G(f)\\ p 



'|aj+l 

£ ii G(i) (/)ii 



OO II •< II oo 



;l + || G W+l(/)[| c 
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Remark 5. Simple algebra shows that -) > is equivalent to 

2 i+2q^ 3 ^ p- This condition is needed to ensure that the discretization error 
over the Besov ball Bp q (M) is negligible relative to the minimax risk. See 
Section 4.2 for more discussions. 



For functions of spatial inhomogeneity, the local smoothness of the func- 
tions varies significantly from point to point and global risk given in Theorem 
2 cannot wholly reflect the performance of estimators at a point. We use the 
local risk measure 

(30) R(f(t ), /(t )) = E(/(t ) - /(to)) 2 

for spatial adaptivity. 

The local smoothness of a function can be measured by its local Holder 
smoothness index. For a fixed point to G (0, 1) and < a < 1, define the local 
Holder class A Q (M, to,<5) as follows: 

A Q (M,t , 5) = {/ : |/(t) - /(t )| < M\t - t Q \ a M t € (t - S,t + 5)}. 

If a > 1, then 

A a (M,t ,5) = {/ : |/ {LaJ) (t)-/ (LQj) (to)| < M\t-t \ a ' for t G (to — <Mo + *)}, 
where [aj is the largest integer less than a and a' = a — [a\ . Define 
F a {M, t , S,e,v) = {f:f€ A a (M, t , 5), f(x) G [e, u] for all x G [0, 1]}. 

In Gaussian nonparametric regression setting, it is a well-known fact that 
for estimation at a point, one must pay a price for adaptation. The optimal 
rate of convergence for estimating /(to) over function class A a (M, to, 5) with 
a completely known is n~ 2a ^ 1+2a \ Lepski (1990) and Brown and Low (1996) 
showed that one has to pay a price for adaptation of at least a logarithmic 
factor. It is shown that the local adaptive minimax rate over the Holder 
class A a (M,t ,S) is (logn/n) 2a ^ 1+2a l 

The following theorem shows that our estimators achieve optimal local 
adaptation with the minimal cost. 

Theorem 3. Suppose the wavelet ip is r-regular with 1/6 < a < r. Let 
t G (0,1) be fixed. Let X t ~ NQ(/(t*)), i = 1, . . . , n,U = LetT = cn 3 / i . 
Then for f = / B js or / NC 

(31) sup E(/(t ) - /(t )) 2 < C ■ 

F a (M,t ,8,e,v) 

Theorem 3 shows that both estimators are spatially adaptive, without 
prior knowledge of the smoothness of the underlying functions. 
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4.1. Regression in general natural exponential families. We have so far 
focused on the nonparametric regression in the NEF-QVF families. Our 
method can be extended to the nonparametric regression in the general 
one-parameter natural exponential families where the variance is no longer 
a quadratic function of the mean. 

Suppose we observe 

(32) Yi '~ NEF(/(tj)), i = l,...,n,U = -, 

n 

and wish to estimate the mean function f(t). When the variance is not a 
quadratic function of the mean, the VST still exists, although the mean- 
matching VST does not. In this case, we set a = b = in (3) and define H m 

as 

(33) H m (X) = G 

We then apply the same four-step procedure, Binning-VST-Gaussian Regres- 
sion-Inverse VST, as outlined in Section 3 where either BlockJS or Neigh- 
Coeff is used in the third step. Denote the resulting estimator by /bjs and 
/nc, respectively. 

The following theorem is an extension of Theorem 1 to the general one- 
parameter natural exponential families where the standard VST is used. 




Theorem 4. Let f £ F d (M,e,v). Then Y* = G(^) can be written as 
(34) Y* =G[/Wj+ e 3 + m-WZj + j = 1,2, 



j, . . . , j. , 



where Zj ' iV(0, 1), ej are constants satisfying \ej\ < c(m 1 + T d ) and 
consequently for some constant C > 



(35) ^e]<C( m -* + T~^ 

i=i 



and £j are independent and "stochastically small" random variables satisfy- 
ing that for any integer k > and any constant a > 

E|^| fe <C k \og 2k m- (m- k + T- dk ) and 

(36) 

Pflfjl >a)<C k log 2k m-(m- k + T- dk )a~ k , 
where Cj~ > is a constant depending only on k,d and M . 
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The proof of Theorem 4 is similar to that of Theorem 1. Note that the 
bound for the deterministic error in (35) is different from the one given in 
equation (18). This difference affects the choice of the bin size. 

Theorem 5. Suppose the wavelet ip is r-regular. Let Xi ~ NEF(/(tj)), i = 
1, . . . ,n,ti = jk Let T = cn 1 / 2 . Then the estimator /bjs satisfies 

C r„-(2a)/(l+2a) 



sup E||/ B j S -/|||< { 

f€F£ q (M,e,v) 



1\ 2a 
p>2,a<r ana la I > 



p) l + 2a' 

C n -(2a)/(i+2a) (log n)( 2 ~ ?, )/( p ( 1+2a )) , 

,/ l\ 2a 
1 <p <2,a <r and [a I > 



p) 1 + 2a ' 



and the estimator /nc satisfies 



sup E||/ NC -/|| 2 <C 

feF£ q (M,e,v) 



l ogn \( 2 «)/(i+ 2 -) 



n 



1\ 2a 
p>i-,a<r and [a > 



p) 1 + 2a 

Remark 6. Note that the number of bins here is T = Ofo 1 / 2 ). This 
gives a larger bin size than that needed with NEF-QVF. Because the VST 
yields higher bias than the mean-matching VST in the case of NEF-QVF, it 
is necessary to use larger bins. The condition (a — |) > is also stronger 

than the condition | (a — -) > , 2< ^ which is needed in the case of NEF- 
QVF. The functions are required to be smoother than before. This is due to 
the fact that both the approximation error and the discretization error are 
larger in this case. See Section 4.2 for more discussions. 

We have the following result on spatial adaptivity. 

Theorem 6. Suppose the wavelet ip is r-regular with ^ < a < r. Let 
t G (0,1) be fixed. Let Xi ~ NEF(/(ti)),t = 1, . . . ,n,U = £. LetT = cn 1 / 2 . 
Then for f = / B js or f NC 

(2a)/(l+2a) 



(37) sup E(/(t„) - /(to)) 2 < C 

feF a (M,t ,S,e,v) 



logn 



Remark 7. In Remark 1 we noted that some nonexponential families 
admit mean-matching variance stabilizing transformations. Although we do 
not pursue the issue in the current paper, we believe that analogs of our 
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procedure can be developed for these families and the basic results in The- 
orems 2 and 3 can be extended to such situations. A different possibility 
is that the error distributions lie in a one parameter family that admits a 
VST that is not mean matching. In that case one could expect analogs of 
Theorems 5 and 6 to be valid. 

4.2. Discussion. Our procedure begins with binning. This step makes 
the data more "normal" and at the same time reduces the number of obser- 
vations from n to T. This step in general does not affect the rate of conver- 
gence as long as the underlying function has certain minimum smoothness so 
that the bias induced by local averaging is negligible relative to the minimax 
estimation risk. While the number of observations is reduced by binning, the 
noise level is also reduced accordingly. 

An important quantity in our method is the value of T, the number of 
bins, or equivalently the value of the bin size m. The choice of T = cn 3 ^ 
for the NEF-QVF and T = cn 1 / 2 for the general NEF are determined by 
the bounds for the approximation error, the discretization error, and the 
stochastic error. For functions in the Besov ball BpJM), the discretiza- 
tion error between the sampled function {G(f(j/T)) : j = 1, . . . , T} and the 
whole function G(f(t)) can be bounded by CT~ 2d where d = (a — |) A 1 

(see Lemma 8 in Section 6.3). The approximation error Y^i=i e i can ^ e 
bounded by C{m~ 4 + T~ 2d ) as in (18). In order to adaptively achieve the 
optimal rate of convergence, these deterministic errors need to be negli- 
gible relative to the minimax rate of convergence n~( 2a ^( 1+2a ^ for all a 
under consideration. That is, we need to have m -4 = o(n~( 2a ^( 1+2a ^) and 
T~ 2d = o(n~( 2a )/( 1+2a )). These conditions put constraints on both m and 
a (and p). We choose m = ere 1 ' 4 (or equivalently T = en 3 / 4 ) to ensure that 
the approximation error is always negligible for all a. This choice also guar- 
antees that the stochastic error is under control. With this choice of m, we 

then need |) > or equivalently 2 " 14 T 2 a /3 > f • 

In the natural exponential family with a quadratic variance function, the 
existence of a mean-matching VST makes the approximation error small 
and this provides advantage over more general natural exponential families. 
For general NEF without a quadratic variance function, the approximation 
error ^ Yll=i e ? ^ s °f order m~ 2 + T~ 2d instead of m -4 + T~ 2d . Making it 
negligible for all a under consideration requires m = cn 1 / 2 . With this choice 
of m, we require a — - > or equivalently 2 -J^_^ > - in order to control 

the discretization error. In particular, this condition is satisfied if a > 1 + |. 

In this paper we present a unified approach to nonparametric regression 
in the natural exponential families and the optimality results are given for 
Besov spaces. As mentioned in the Introduction, a wavelet shrinkage and 
modulation method was introduced in Antoniadis and Sapatinas (2001) for 
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regression in the NEF-QVF and it was shown that the estimator attains the 
optimal rate over the classical Sobolev spaces with the smoothness index 
a > 1/2. In comparison to the results given in Antoniadis and Sapatinas 
(2001), our results are more general in terms of the function spaces as well 
as the natural exponential families. On the other hand, we require slightly 
stronger conditions on the smoothness of the underlying functions. It is 
intuitively clear that through binning and VST a certain amount of bias is 
introduced. The conditions |(a — ~) > in the case of NEF-QVF and 
a — - > TTTr— in the general case are the minimum smoothness condition 
needed to ensure that the bias is under control. The bias in the general NEF 
case is larger and therefore the required smoothness condition is stronger. 

5. Numerical study. In this section we study the numerical performance 
of our estimators. The procedures introduced in Section 3 are easily imple- 
mentable. We shall first consider simulation results and then apply one of 
our procedures in the analysis of two real data sets. 

5.1. Simulation results. As discussed the Section 2, there are several dif- 
ferent versions of the VST in the literature and we have emphasized the 
importance of using the mean-matching VST for theoretical reasons. We 
shall now consider the effect of the choice of the VST on the numerical 
performance of the resulting estimator. To save space we only consider the 
Poisson and Bernoulli cases. We shall compare the numerical performance of 
the mean-matching VST with those of classical transformations by Bartlett 
(1936) and Anscombe (1948) using simulations. The transformation formu- 
lae are given as follows. (In the following tables and figures, we shall use 
MM for mean-matching.) 



MM Bartlett Anscombe 

Poi(A) y/X + T/4 VX s/X + 3/8 

Bin(m,p) sin" 1 J £££ shr 1 ,/^ sin" 1 

V y m + 1/2 y m y m+3/4 

Four standard test functions, Doppler, Bumps, Blocks and HeaviSine, 
representing different level of spatial variability are used for the compari- 
son of the three VSTs. See Donoho and Johnstone (1994) for the formulae 
of the four test functions. These test functions are suitably normalized so 
that they are positive and taking values between and 1 (in the binomial 
case). Sample sizes vary from a few hundred to a few hundred thousand. 
We use Daubechies' compactly supported wavelet Symmlet 8 for wavelet 
transformation. As is the case in general, it is possible to obtain better es- 
timates with different wavelets for different signals. But for uniformity, we 
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Table 1 

Mean squared error (MSE) from 100 replications. The MSE is in units of 10 
for Bernoulli case and 10~ 2 for Poisson case 





MM 


Bartlett 


Anscombe 


MM 


Bartlett 


Anscombi 








Bernoulli 








Doppler 








Bumps 








1280 


12.117 


11.197 


12.673 


1280 


7.756 


8.631 


7.896 


5120 
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0.992 
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8.205 
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3.352 


3.160 
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68.616 


70.495 
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1.426 


1.146 


10,240 


24.427 
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0.852 


0.501 


40,960 


1.424 


1.773 


1.495 


40,960 


0.213 


0.560 


0.298 


163,840 


0.508 


0.890 


0.573 


163,840 


0.118 


0.455 


0.206 



use the same wavelet for all cases. Although our asymptotic theory only 
gives a justification for the choice of the bin size of order n 1 / 4 due to techni- 
cal reasons, our extensive numerical studies have shown that the procedure 
works well when the number of counts in each bin is between 5 and 10 for 
the Poisson case, and similarly for the Bernoulli case the average number 
of successes and failures in each bin is between 5 and 10. We follow this 
guideline in our simulation study. Table 1 reports the average squared er- 
rors over 100 replications for the BlockJS thresholding. The sample sizes are 
1280, 5120, . . . , 327,680 for the Bernoulli case and 640, 2560, . . . , 163,840 for 
the Poisson case. A graphical presentation is given in Figure 5. 

Table 1 compares the performance of three nonparametric function esti- 
mators constructed from three VSTs and wavelet BlockJS thresholding for 
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Bartlett vs. MM 



Anscombe vs. MM 
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Doppler Bumps Blocks HeaviSine 
Bartlett vs. MM 




Doppler Bumps Blocks HeaviSine 
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1.4- 



2+- 



1.4 



Doppler Bumps Blocks HeaviSine 
Anscombe vs. MM 



1 



Doppler Bumps Blocks HeaviSine 



Fig. 5. Left panels: the vertical bars represent the ratios of the MSE of the estimator using 
the Bartlett VST to the corresponding MSE of our estimator using the mean-matching 
VST. Right Panels: the bars represent the ratios of the MSE of the estimator using the 
Anscombe VST to the corresponding MSE of the estimator using the mean-matching VST. 
The higher the bar the better the relative performance of our estimator. The bars are plotted 
on a log scale and the original ratios are truncated at the value 3 for the Bartlett VST and 
at 2 for the Anscombe VST. For each signal the bars are ordered from left to right in the 
order of increasing sample size. The top row is for the Bernoulli case and the bottom row 
for the Poisson case. 



Bernoulli and Poisson regressions. The three VSTs are the mean-matching, 
Bartlett and Anscombe transformations given above. The results show the 
mean-matching VST outperforms the classical transformations for nonpara- 
metric estimation in most cases. The improvement becomes more significant 
as the sample size increases. 

In the Poisson regression, the mean-matching VST outperforms the Bartlett 
VST in 17 out of 20 cases and uniformly outperforms the Anscombe VST in 
all 20 cases. The case of Bernoulli regression is similar: the mean-matching 
VST is better than the Bartlett VST in 15 out of 20 cases and better than 
the Anscombe VST in 19 out of 20 cases. Although the mean-matching 
VST does not uniformly dominate either the Bartlett VST or the Anscombe 
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VST, the improvement of the mean-matching VST over the other two VSTs 
is significant as the sample size increases for all four test functions. The 
simulation results show that mean-matching VST yields good numerical re- 
sults in comparison to other VSTs. These numerical findings is consistent 
with the theoretical results given in Section 4 which show that the estima- 
tor constructed from the mean-matching VST enjoys desirable adaptivity 
properties. 

Table 2 reports the average squared errors over 100 replications for the 
NeighCoeff procedure in the same setting as those in Table 1. In compari- 
son to Block JS, the numerical performance of NeighCoeff is overall slightly 
better. Among the three VSTs, the mean-matching VST again outperforms 
both the Anscombe VST and Bartlett VST. 

We have so far considered the effect of the choice of VST on the per- 
formance of the estimator. We now discuss the Poisson case in more de- 
tail and compare the numerical performance of our procedure with other 
estimators proposed in the literature. As mentioned in the Introduction, 
Besbeas, De Feis and Sapatinas (2004) carried out an extensive simulation 
studies comparing several nonparametric Poisson regression estimators in- 
cluding the estimator given in Donoho (1993). The estimator in Donoho 
(1993) was constructed by first applying the Anscombe (1948) VST to the 
binned data and by then using a wavelet procedure with a global threshold 
such as VisuShrink [Donoho and Johnstone (1994)] to the transformed data 
as if the data were actually Gaussian. Figure 6 plots the ratios of the MSE 
of Donoho's estimator to the corresponding MSE of our estimator. The re- 
sults show that our estimator outperforms Donoho's estimator in all but one 
case and in many cases our estimator has the MSE less than one half and 
sometimes even one third of that of Donoho's estimator. 

Besbeas, De Feis and Sapatinas (2004) plotted simulation results of 27 pro- 
cedures for six intensity functions (Smooth, Angles, Clipped Blocks, Bumps, 
Spikes and Bursts) with sample size 512 under the squared root of mean 
squared error (RMSE). We apply NeighCoeff and BlockJS procedures to 
data with exactly the same intensity functions. The following table reports 
the RMSE of NeighCoeff and BlockJS procedures based on 100 replications: 

We compare our results with the plots of RMSE for 27 methods in Bes- 
beas, De Feis and Sapatinas (2004). The NeighCoeff procedure dominates all 
27 methods for signals Smooth and Spikes, outperforms most of procedures 
for signals Angles and Bursts, and performs slightly worse than average for 
signals Clipped Blocks and Bumps. The BlockJS procedure is comparable 
with the NeighCoeff procedure except for two signals Clipped Blocks and 
Bumps. We should note that an exact numerical comparison here is difficult 
as the results in Besbeas, de Feis and Sapatinas (2004) were given in plots, 
not numerical values. 
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Table 2 

Mean squared error (MSE) from 100 replications for the NeighCoeff thresholding. The 
MSE is in units of 10 -3 for Bernoulli case and 10 -2 for Poisson case 
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5.2. Real data applications. We now demonstrate our estimation method 
in the analysis of two real data sets, a gamma-ray burst data set (GRBs) and 
a packet loss data set. These two data sets have been discussed in Kolaczyk 
and Nowak (2005). 

Cosmic gamma-ray bursts were first discovered in the late 1960s. In 1991, 
NASA launched the Compton Gamma Ray Observatory and its Burst and 
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Fig. 6. The vertical bars represent the ratios of the MSE of Donoho's estimator to the 
corresponding MSE of our estimator. The higher the bar the better the relative performance 
of our estimator. The bars are plotted on a log scale and the original ratios are truncated at 
the value 3. For each signal the bars are ordered from left to right in the order of increasing 
sample size. 

Transient Source Explorer (BATSE) instrument, a sensitive gamma-ray de- 
tector. Much burst data has been collected since then, followed by exten- 
sive studies and many important scientific discoveries during the past few 
decades; however, the source of GRBs remains unknown [Kaneko (2005)]. For 
more details see the NASA website http:/ /www. batse.msfc.nasa.gov/batse/. 
GRBs seem to be connected to massive stars and become powerful probes 
of the star formation history of the universe. However not many redshifts 
are known and there is still much work to be done to determine the mecha- 
nisms that produce these enigmatic events. Statistical methods for temporal 
studies are necessary to characterize their properties and hence to identify 
the physical properties of the emission mechanism. One of the difficulties in 
analyzing the time profiles of GRBs is the transient nature of GRBs which 
means that the usual assumptions for Fourier transform techniques do not 
hold [Quilligan et al. (2002)]. We may model the time series data by an 
inhomogeneous Poisson process, and apply our wavelet procedure. The data 
set we use is called BATSE 551 with the sample size 7808. In Figure 7, the 
top panel is the histogram of the data with 1024 bins such that the number 
of observations in each bin would be between 5 and 10. In fact we have on 
average 7.6 observations. The middle panel is the estimate of the intensity 
function using our procedure. If we double the width of each bin, that is, 
the total number of bins is now 512, the new estimator in the bottom panel 
is noticeably different from previous one since it does not capture the fine 
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structure from time 200 to 300. The study of the number of pulses in GRBs 
and their time structure is important to provide evidence for rotation pow- 
ered systems with intense magnetic fields and the added complexity of a 
jet. 

Packet loss describes an error condition in internet traffic in which data 
packets appear to be transmitted correctly at one end of a connection, but 
never arrive at the other. So, if 10 packets were sent out, but only 8 made 
it through, then there would be 20% overall packet loss. The following data 
were originally collected and analyzed by Yajnik et al. (1999). The objec- 
tive is to understand packet loss by modeling. It measures the reliability 
of a connection and is of fundamental importance in network applications 
such as audio/video conferencing and Internet telephony. Understanding the 
loss seen by such applications is important in their design and performance 
analysis. The measurements are of loss as seen by packet probes sent at 
regular time intervals. The packets were transmitted from the University of 
Massachusetts at Amherst to the Swedish Institute of Computer Science. 
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Fig. 7. Gamma-ray burst. Histogram of BATSE 551 with 1024 bins (top panel). Esti- 
mator based on 1024 bin (middle panel). Estimator with 512 bins (bottom panel). 
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The records note whether each packet arrived or was lost. It is a Bernoulli 
time series, and can be naturally modeled as Binomial after binning the 
data. Figure 8 gives the histogram and our corresponding estimator. The 
average sum of failures in each bin is about 10. The estimator in Kolaczyk 
and Nowak (2005) is comparable to ours. But our procedure is more easily 
implemented. 

6. Proofs. In this section we give proofs for Theorems 1, 2 and 5. Theo- 
rems 3 and 6 can be proved in a similar way as Theorem 4 in Brown, Cai and 
Zhou (2008) by applying Proposition 1 in Section 6.3. We begin by proving 
Lemmas 1 and 3 as well as an additional technical result, Lemma 4. These 
results are needed to establish Theorem 1 in which an approximation bound 
between our model and a Gaussian regression model is given explicitly. Fi- 
nally we apply Theorem 1 and risk bounds for block thresholding estimators 
in Proposition 1 to prove Theorems 2 and 5. 

6.1. Proof of preparatory technical results. 

Proof of Lemma 1 . We only prove (4), the first part of the lemma. The 
proof for equation (5), the second part, is similar and simpler. By Taylor's 




Fig. 8. Packet loss data. Histogram with 2048 bins (top panel). Estimator based on the 
binned data (bottom panel). 
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expansion we write 



G 



where 

1 



X + a 
m + b 

X + a 



G(^(r ? ))=r 1 + T 2 + r 3 + r4, 



m + b 
X + a 



T 2 = -G"(j i (r,)) 



X + a 



m + b 
X + a 



m + b 



m + b ^ w 7 ' 24 

and //* is in between an d M 7 ?)- By definition, G'(n(rj)) = /(r/) -1 / 2 with 
/(r/) = n'(rj) which is also V(/x(r/)) in (2), then 

G"{ l i{r } )) l i'{r } ) = -\l{r } )-* /2 I'{r } ), 



that is, 



then 



G"^(r ] ))=-\l(r ] )-^l'{^ 



ETi = /(r?) 



. 1/2 a-fj,(rj)b 
m + b 



4 w; IA + 6 / (to + o)^ 



Note that G"{fj,{rj)) is uniformly bounded on by the assumption in the 
lemma, then we have 



E(Ti + T 2 



??? 



(38) 



(m + 6) 2 /(r ? ) 1 /2 



a-n(rj)b 



1 



1 



ml{rf) l l 2 
It is easy to show that 
1 



+ — 

1^ 



m- 



(39) 



^"(Ml))E 



X + a 
m + b 



1^1 



since |E(X/m — /u(?y)) 3 | = O(-ij-). For any e > it is known that 
X + a 



m + b 



>e\<¥{\X/m- f M(r l )\>e/2}, 



which decays exponentially fast as m — > oo [see, e.g., Petrov (1975)]. This 
implies ji* is in the interior of the natural parameter space and then G^(/j,*) 
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is bounded with probability approaching to 1 exponentially fast. Thus we 
have 

X + a , A 4 / 1 



( 40) | ET4 |< CE |_ _ M „)j = ^_| 

Equation (4) then follows immediately by combining equations (38)-(40). 
□ 

Proof of Lemma 2. The proof is similar to Corollary 1 of Zhou (2006). 
Let X = ^J m y ■ It is shown in Komlos, Major and Tusnady (1975) that there 
exists a standard normal random variable Z ~ N(0, 1) and constants e, C4 > 
not depending on m such that whenever the event A = {\X\ < Si/m} occurs, 

(41) \X-Z\<^= + ^X 2 . 

\/m \/m 



Obviously inequality (41) still holds when \X\ < £\\fm for < e\ < e. Let's 
choose £\ small enough such that &±e\ < 1/2. When \X\ < £iy/m, we have 
\X - Z\ < + \\X\ from (41), which implies \X\ - \Z\ < + l\X\ by 
the triangle inequality, that is, \X\ < + 2\Z\, so we have 

\X-Z\<^ + ^(^L + 2\Z\) 2 <c 2 Z 2 + c 3 



for some constants ci,C2 > 0. □ 

Proof of Lemma 3. By Taylor's expansion we write 

\m + bj \m + b J 2 \m + o 

Recall that |e| = |EG(^±f) - = 0(m~ 2 ) from Lemma 1, and Z is a 

standard normal variable satisfying (13), and 

(42) t = G(^-)-G(»)-e-m-^Z. 

\m + J 

We write £ = £1 + £2 + £3, where 



m + 6 m/ m(m + 6) 

£2 = G'(fi)( — — At — \/— Z ] = ^-^(X-7n / u-V^ F Z), 




X — to/j a — 6//\ 2 
m + b m + b J 
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It is easy to see that E|£i| fc < C k m~ k . Since P{|X — m\x\ > c\m} is expo- 
nentially small [cf. Komlos, Major and Tusnady (1975)], an application of 
Lemma 2 implies E|£2| fc < C k m~ k . Note that on the event — m//| < c±m}, 
G" '(/■**) is bounded for m sufficiently large, then IE | ^"3 1 fc < C k m~ k by observ- 
ing that E[(X - m^)/y/m\ 2k < C' k . The inequality E|£| fc < C k m~ k then fol- 
lows immediately by combining the moments bounds for £1, £2 an d £3. The 
second bound in (16) is a direct consequence of the first one and Markov 
inequality. □ 

The variance stabilizing transformation considered in Section 2 is for i.i.d. 
observations. In the function estimation procedure, observations in each bin 
are independent but not identically distributed. However, observations in 
each bin can be treated as i.i.d. random variables through coupling. Let 
Xi ~ NQ(//j), i = 1, . . . ,m, be independent. Here the means \i{ are "close" 
but not equal. Let X^ c be a set of i.i.d. random variables with X^ c ~ NQ(/i c ). 
We define 

D = c , ET=i x i + a \ _ G fET=i x i,c + a 



m + b J \ m + b 

If n c = maxj/ii, it is easy to see ED < since Xj jC is stochastically larger 
than Xi for all i [see, e.g., Lehmann and Romano (2005)]. Similarly, ED > 
when ji c = minj jjn . We will select a 



(43) fi* e 



mm \i\ , max /ij 



such that ED = 0, which is possible by the intermediate value theorem. In 
the following lemma we construct i.i.d. random variables Xj ]C ~ NQ(/i*) 
on the sample space of Xi such that D is very small and has negligible 
contribution to the final risk bounds in Theorems 2 and 3. 



Lemma 4. Let Xi ~ NQ(/ij), i = 1, . . . , m, be independent with /ij E [e, v], 
a compact subset in the interior of the mean parameter space of the natural 
exponential family. Assume that |minj — maxj/z,| < C8. Then there are 
i.i.d. random variables Xi >c where Xj jC ~NQ( / u*) with fi* G [minj/Xj,maxj fj^] 
such that ED = and: 

(i) 

(44) F({Xi ± X hC }) < CS; 

(ii) and for any fixed integer k>l there exists a constant C k > such 
that for all a > 0, 

E\D\ k < C k log 2k m ■ (m~ k + 5~ k ) and 

(45) 

1 2k 

P(|D| > a) < Ck ^_^(m~ k + <T fc ). 

n tx, 
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Proof, (i) There is a classical coupling identity for the Total variation 
distance. Let P and Q be distributions of two random variables X and Y on 
the same sample space, respectively, then there is a random variable Y c with 
distribution Q such that F(X ^ Y c ) = \P — Q\tv- See, for example, page 256 
in Pollard (2002). The proof of inequality (44) follows from that identity 
and the inequality that |NQ(/Xj) — NQ(^i*)|tv < C\fii — for some C > 
which only depends on the family of the distribution of Xi and [e, v] . 

(ii) Using Taylor's expansion we can rewrite D as D = G (Q m + b — — 

for some £ in between ^^+6 + ° ana ^ ^H^q~ir~^~- Since the distribution Xi 
is in exponential family, then P(maxj \Xi — A^ jC | > log 2 m) < Ck>m~ k ' for all 
k' > 0, which implies E|Xj — Xi >c \ k < CkSlog 2k m fo all positive integer k. 
Since Xi — Xi jC are independent, it can be shown that 



-Y^\Xi-X hC \\ 



— T- E ( ; 7 ) E\X\ — X\ c \ kl ■ ■ ■ E\X m — X m c \ k 
m \ u \ki,...,k m J 

:7fcE £ (hi k k ) E \ Xl ~ X hc\ k i ■■■E\Xm -X m> 



km 



j=l feiH hfc m =fc, 

Card{i,fci>l}=j 

Jq 2k k 

< C k t— VV ■ Caxd{(fci,. . ■ , k m ) : k\ H \-k m = k, 

Caxd{t,^ >!}=;} 



where the last inequality follows from the facts that k is fixed and finite and 
Card{(/ci, . . . , k m ) : k% H + k m = k, Card{i, ki > 1} = j} 



~\ Card{(/ci, . . . , kj) : k± H \- kj = k, ki > 1} 



<( m U fc <m^ fe . 



Note that 



NONPARAMETRIC REGRESSION IN EXPONENTIAL FAMILIES 

k +S k 



35 



III 



mi k 8i 



{m6) 



E 



A^- + (m5) k - j > 1 for all k > j > 1, then 

k 

<C^log 2k m-(m~ k + 6 k ). 



— \x% — Xi c \ ) 



Thus the first inequality in (45) follows immediately by observing that G"(£) 
is bounded with a probability approaching to 1 exponentially fast. The sec- 
ond bound is an immediate consequence of the first one and Markov inequal- 
ity □ 

Remark 8. The unknown function / in a Besov ball Bp q (M) has Holder 
smoothness d = min(a — ^, 1), then 5 in Lemma 4 can be chosen to be T~ d . 
The standard deviation of normal noise in equation (17) is \ j ^frri. From the 
assumptions in Theorems 2 or 3 we see m L l 2 T~ d \og 2 m converges to as a 
power of n, then 

F(\D\ > 



< C k [(m~ 1/2 log 2 m)m~ k + {^T~ d \og 2 m) k ] 



for all k > 1, 



which converges to faster than any polynomial of m. This implies the 
contribution of D to the final risk bounds in all major theorems is negligible 
as shown in later sections. 

6.2. Proof of Theorem 1. From Lemma 4, there exist Y? c where Xi c ~ 
NQ(/*) with " 



f* G 



min / 1 — I , max 

jm.+l<i<(j+l)m \n J jm+l<i<(j+l)m 



/I- 

n 



as in (43) such that 

(46) E[Y*-Y* c } = 0, 

(47) E\Y* - Y*f < C k log 2k m ■ {m~ k + T' dk ), 

1 2k 

(48) P(|Y/ - Y* c \ >a)< C k ^^{m~ k + T~ dk ). 
Lemmas 1, 2 and 3 together yield 

(49) Y * c = G(fl c ) + e j + m- 1 / 2 Z j + ^ j , j = l,2,...,T, 
and 

(50) \ej\<Cm- 2 , E\^\ k < C k m~ k and P(|^-| > a) < C k {am)~ k . 
Note that 



(51) 



G(/* c )-Gl/l J - 
Theorem 1 then follows immediately by combining equations (46)-(51). 



J 



< CT~ d . 
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6.3. Risk bound for wavelet thresholding. We collect here a few technical 
results that are useful for the proof of the main theorems. We begin with 
the following moment bounds for an orthogonal transform of independent 
variables. See Brown et al. (2010) for a proof. 

Lemma 5. Let X%, . . . ,X n be independent variables with E(Xj) = for 
i = 1, . . . ,n. Suppose that E|X;| fe < for all i and all k > with > 
some constant not depending on n. Let Y = WX be an orthogonal transform, 
of X = (Xi, . . . ,X n )'. Then there exist constants M' k not depending on n 
such that K\Yi\ k < ML for all i = 1, . . . ,n and all k > 0. 

Lemma 6 below provides an oracle inequality for block thresholding esti- 
mators without the normality assumption. 



Lemma 6. Suppose y% = 6% + Zi,i = 1, . . . ,L, where 9\ are constants and 
Zi are random variables. Let S 2 = Yli=i Hi an( ^ ^ = (I — Then 



(52) 



E||0-0|||< ||0||1 A4AL + 4E[||z|||/(||z||| > AL)]. 



PROOF. It is easy to verify that ||0 — < AL. Hence 
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<4E[||z||lJ(||z||I>AL)]. 
On the other hand, 

E[||0-0||§I(||z||l< AL)] 

< E[(2||0 - yf 2 + 2\\y - 0||l)I(||z||l < AL)] < 4AL. 



(54) 



Note that when S 2 < AL, 9 = and hence \\9- 0||| 
and S 2 > AL, 
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Hence E[||0 - 0||ll(||z||l < XL)] < ||0||§ and ( 52) follows by combining this 
with (53) and (54). □ 

The following bounds concerning a central chi-square distribution are 
from Cai (2002). 

Lemma 7. Let X ~ x\ an d A > 1. Then 

F(X > XL) < e -i/2(A-iogA-i) and 

(55) 

EXI(X > XL) < \ Le ^L/2(x~io S x^i)_ 

From (17) in Theorem 1 we can write -j7f¥* = G ^^f^ + + + -J=. 
Let (uj t k) = T~ l l 2 W ■ Y* be the discrete wavelet transform of the binned 
and transformed data. Then one may write 

( 56 ) u j,k = 0j,k + e j,k + —]= z j,k + £j,k, 

where 0'- k are the discrete wavelet transform of (G(f(i/T))/VT) which are 
approximately equal to the true wavelet coefficients of G(f), Zj t f. are the 
transform of the Z^s and so are i.i.d. iV(0,l) and e^fc and £j t k are, respec- 
tively, the transforms of (-^) and ("^)- Then it follows from Theorem 1 
that 

j k i 

and for all i > and a > we have 

E|£ ife P < C;iog 2fc m[(mn)- i / 2 + T-( <i+1 / 2 ) i ], 

(58) 

P(|&fc| > a) < ^log 2fc mK^mn)"^ 2 + (aT^ 1 / 2 )^] 

from Theorem 1 and Lemma 5. 

Lemmas 6 and 7 together yield the following result on the risk bound for 
a single block. 

Proposition 1. Let the empirical wavelet coefficients Uj^k = ^fc + e j,fc + 
'^ z j,k + ^ e given as in (56) and let the block thresholding estimator dj^k 
be defined as in (24-). Then: 

(i) for some constant C > 0, 

E £ (^-^) 2 <min|4 £ (#;. fe ) 2 , SA^n" 1 j 
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(59) 

(ii) for any < r < 1 , there exists a constant C T > depending on r only 
such that for all (j, k) G B' 1 -, 

(60) E(fy, k - e' j>k f < C T ■ min{ max {{d' j k + e^) 2 },^" 1 ) + n~ 2+T . 



The following is a standard bound for wavelet approximation error. It 
follows directly from Lemma 1 in Cai (2002). 

Lemma 8. Let T = 2 J and d = min(a — |, 1). Set 

T 1 

gj(x) = Y,^G(f(k/n))^ k (x). 
k=i v 1 

Then for some constant C > 

(61) sup \\gj-G(f)f 2 <CT- 2d . 

geF£ q (M,s) 

We are now ready to prove our main results, Theorems 2 and 5. 

6.4. Proofs of Theorems 2 and 5. We shall only prove the results for 
the estimator /bjs- The proof for /nc is similar and simpler. Let G(f) = 
max{G(/), 0} for negative Binomial and NEF-GHS distributions and G(f) = 
G(f) for other four distributions. We have 

E||/- fg = E||G- 1 [G(7)] _ G-\G(f))\\l=n{G- l )\g)[G(f) - G{f)]f 2 
<E [ V{G~\g))[G{f)-G{f)} 2 dt, 



where g is a function in between G(f) and G(f). We will first give a lemma 
which implies V(G~ 1 (g)) is bounded with high probability, then prove The- 
orems 2 and 5 by establishing a risk bound for estimating G(f). 

Lemma 9. Let G(f) be the BlockJS estimator of G(f) defined in Section 
3. Then there exists a constant C > such that 

sup P{||G(/)|| 00 >C7}<Qn- z 

feF£ q (M,E,v) 

for any I > 1, where C\ is a constant depending on I. 
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Proof. Recall that we can write the discrete wavelet transform of the 
binned data as 

1 



u 3,k = Qj.k + e hk + ~7=Zj,k + £j,k, 



n 



where 9j k are the discrete wavelet transform of ( G (f*jJJ'>) ) which are ap- 
proximately equal to the true wavelet coefficients 9jk of G(f). Note that 
\0' jk - 9 jk \ = 0(2-^ d+1 / 2 )), for d = min(a - l/p, 1). Note also that a Besov 
Ball Bp q (M) can be embedded in I^ j00 (Mi) for some M\ > [see, e.g., 
Meyer (1992)]. From the equation above, we have 

2^0 J-l 23 

E^o,*^o,fc(<) + E E^*W) e 5 -,oo(m 2 ) 
fc=i j=j k=i 

for some M 2 > 0. Applying the Block thresholding approach, we have 



1 



XLa 2 



+ 1 



1 



o2 / U J 



/-Zj.k + fj'.fc 



11 



= 0i,jk + ®2,jk + 3 ,jk for (j, fc) € , jo <j<J- 

Note that l^ijfel < |^- fe | and so 51 = 9' jo>k (f>j ,k + Y^j=j ELl Kj,k^Pj,k G 

B^ ^(il^). This implies 51 is uniformly bounded. Note that 

Tl/2 (E(s>)) 1/2 =r 1/2 -o(m" 2 ) = (i), 

so is a uniformly bounded vector. For < /3 < 1/6 and a 

constant a > we have 



P(|03j*| > a2~^ +1 / 2 )) < P(|0 3Jfe | > aT~^ +1 



/2)> 



< 



?1 



2 



for any £ > 1 by Mill's ratio inequality and equation (58). Let 



A = (J{\e 3ijk \>a2-^+W}. 

j,k 
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Then ¥(A) = C\n~ l . On the event A c we have 

J-l 2J 

= E E ^jktpj,k(t) G B£ j0O (M 3 ) for some M 3 > 0, 
i=io fe=i 

which is uniformly bounded. Combining these results, we know that for C 
sufficiently large 

(62) sup F{\\G(J)\\ 00 >C}< sup P(A) = C l n- 1 . 

f€F£ q (M,e,v) feF« q (M,e) □ 

Now we are ready to prove Theorems 2 and 5. Note that G~ l is an in- 
creasing and nonnegative function, and V is a quadratic variance function 
[see (1)]. Lemma 9 implies that there exists a constant C such that 

sup n\\V{G- l (g))\\ 00 >C}<C l n- 1 

feF« q (M,e,v) 

for any I > 1. Thus it is enough to show supf £Fa ( Mj£ „) E||G(/) — G(/)||2 < 

Cn -(2a)/(l+2a) for p > 2 an d C n -(2«)/(l+2a) ^ *)(2-p)/(p(l+2a)) f or 1 < p < 

2 under assumptions in Theorems 2 and 5. 

PROOF of Theorem 2. Let F and 9 be given as in (32) and (24), 
respectively. Then 



E||GCf)-G(/)||| = ^E(^ 0)fe -^ 



k ) 2 



k 



(63) +EE E (^-M 2 +EE^ 

3=30 k j=J k 

= S 1 +S 2 + S 3 . 

It is easy to see that the first term S\ and the third term S3 are small: 

(64) Si = 2 jo n~ l e 2 = o{n^ 2a ^ l+2a ^). 
Note that for x G R m and < p\ < P2 < 00, 

(65) ||x|| P2 < \\x\\ pi < m l / pi - l / p2 \\x\\ P2 . 

Since / G B* q (M), so 2^(^]^ =1 |6» jfe | p ) 1/p < Af. Now (65) yields that 

00 

(66) S 3 = EE^ ^ ^-^^/a-i/iO). 

3=J k 
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Proposition 1, Lemma 8 and (57) yield that 

s 2 < 2 £ £>fe - e' hk f + 2 ]T £(^, fe - 

i=io i=io k 

J-1 



<EE min { 8 E OhM^Ln-A 



(67) j i j x 

+ e E E 4* + + 10 E £%* - W 

j=io fc j=jo ft 

J-1 2»7£ r . 
-EE min l 8 E °lk> SA^Ln" 1 I + Cm" 4 + Cn' 1 + CT~ 
j=jo i=l ^ {j,k)eB} 



2,1 



which we now divide into two cases. First consider the case p>2. Let J\ = 
[j^^g 2 n]. So > 2Jl ~n 1 /( 1 + 2 «). Then (67) and (65) yield 

J-L-12P/L J-1 

s 2 < 8A* e E + 8 E E + Cn_1 + CT ~ 2d 

3=30 i=l i=Ji fc 

(68) 

< Cn -2a/(l + 2a)_ 

By combining (68) with (64) and (66), we have E||0 - 6\\ 2 2 < Cn" 2a /( 1+2a ) , 
for p > 2. □ 

Now let us consider the case p < 2. First we state the following lemma 
without proof. 

Lemma 10. Let < p < 1 and 5 = {x G R^EiU^i < > <M = 
1, . . . , fc}. T/ien sup^ £*U(:e; A A) < 6 ■ /or all A > 0. 

Let J 2 be an integer satisfying 2 J2 x n 1 /( 1 + 2a ) (i ogn )(2- P )Mi+2a) _ Note 
that 

VJL f x p/2 _2f_ 



E( E «?,*) <£(^) p/2 <m 2 - 



It then follows from Lemma 10 that 
J-1 v/L 



EE min { 8 E OlkM^Ln- 1 } 

j=J 2 i=l ^ (i,fc)GBj 
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(69) 

< Cn -(2a)/(l+2a)n og s(2-p)/(p(l+2a))_ 

On the other hand, 

3=30 i=l ^ (,i,h)eB) 

J2-1 

(70) < ^SKLn' 1 

3=30 b 

< C?i^ (2a)/(1+2a) (logn) (2 ^ p)/(p(1+2a)) . 

Putting (64), (66), (69) and (70) together yields E||f?-0||| < C n -( 2a )/( 1+2Q ) x 
(logn) (2_p)/(p(1+2a)) . 



Proof of Theorem 5. The proof of Theorem 5 is similar to that of 
Theorem 2 except the step of (67). We will thus omit most of the details. For 
a general natural exponential family the upper bound for ^2jlj Sfc e j k in- 
equation (67) is C(m" 2 +T~ 2d ) as given in Section 2, so (67) now becomes 

J-l V/L 

S 2 < Y Yl min l 8 £ 9 hi SA^Ln" 1 \ + Cm~ 2 + Cn' 1 + CT~ 2d . 

3=30 i=l ^ (i,fe)6Bj 

For m = era -1 / 2 , we have m -2 = c 2 n _1 . When a — ^ > 1 ^ a , it is easy to see 

T _2rf = o(n _2a// ( 1+2a )). Theorem 5 then follows from the same steps as in 
the proof of Theorem 2. □ 
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